Multi-Modal Stacking Ensemble for the Diagnosis of Cardiovascular Diseases

Background: Cardiovascular diseases (CVDs) are a leading cause of death worldwide. Deep learning methods have been widely used in the field of medical image analysis and have shown promising results in the diagnosis of CVDs. Methods: Experiments were performed on 12-lead electrocardiogram (ECG) databases collected by Chapman University and Shaoxing People’s Hospital. The ECG signal of each lead was converted into a scalogram image and an ECG grayscale image and used to fine-tune the pretrained ResNet-50 model of each lead. The ResNet-50 model was used as a base learner for the stacking ensemble method. Logistic regression, support vector machine, random forest, and XGBoost were used as a meta learner by combining the predictions of the base learner. The study introduced a method called multi-modal stacking ensemble, which involves training a meta learner through a stacking ensemble that combines predictions from two modalities: scalogram images and ECG grayscale images. Results: The multi-modal stacking ensemble with a combination of ResNet-50 and logistic regression achieved an AUC of 0.995, an accuracy of 93.97%, a sensitivity of 0.940, a precision of 0.937, and an F1-score of 0.936, which are higher than those of LSTM, BiLSTM, individual base learners, simple averaging ensemble, and single-modal stacking ensemble methods. Conclusion: The proposed multi-modal stacking ensemble approach showed effectiveness for diagnosing CVDs.


Introduction
Cardiovascular diseases (CVDs) are a global public health problem and result from a variety of causes. Since CVDs are a disease of multifactorial origin, it is not easy to accurately and timely diagnose the disease [1]. Early and accurate diagnosis and treatment of CVDs can significantly reduce the risk of morbidity and mortality, making rapid and accurate CVDs prediction a crucial task in healthcare. Cardiologists use various tools to diagnose cardiovascular diseases, and one commonly used tool is the electrocardiogram (ECG). It enables quick detection of abnormal heart rhythms and potential heart disease signs without any intervention [2,3]. In particular, the most frequently used complementary exam for cardiac evaluation is a standard short-duration 12-lead ECG (S12L-ECG) since it can provide a comprehensive evaluation of the heart's electrical activity. Therefore, the S12L-ECG system is used in various medical environments, ranging from primary care centers to intensive care units [4,5].
However, the ECG signal is complex and can be affected by various factors, such as noises and motion artifacts [6]. This makes it challenging to accurately diagnose CVDs. One way to overcome this limitation is to apply deep learning methods. Deep learning methods have been used to improve the accuracy of CVDs diagnosis by automatically learning features from the ECG signal that are relevant to the CVDs. When it comes to deep learning techniques utilized in detecting CVDs, recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent units (GRU) have been extensively employed [7][8][9]. Faust et al. used a bidirectional LSTM (BiLSTM) to identify atrial fibrillation beats in heart BiLSTM, 12 individual base learners, simple averaging ensemble, and the single-modal stacking ensemble.

Dataset and Preprocessing
The dataset used in this research was a 12-lead ECG database that was collected by Chapman University and Shaoxing People's Hospital in China [6]. The 12-lead ECG database, which was recorded at a sampling frequency of 500 Hz, consisted of 10,646 patients (including 5956 males) and each recording lasted for 10 s. The ECG database contained 11 different heart rhythms labelled by professional physicians. Since raw ECG signals contain unwanted noise, the following three preprocessing steps were sequentially applied: Butterworth lowpass filter (LPF), local polynomial regression smoother (LOESS) curve fitting, and non-local means (NLM) technique [21][22][23]. The Butterworth LPF was used to remove signals with frequencies above the typical frequency range of a normal ECG (0.5 Hz to 50 Hz). To eliminate the baseline wandering effect that can be caused by respiration, the LOESS curve fitting method was used. The NLM technique was employed to reduce residual noises. Of the ECG data, 58 ECG recordings were excluded from the study since they either only had zeros or some of their channel values were incomplete. Among the remaining 10,588 data, the number of ECG samples with atrial tachycardia (AT), atrioventricular node reentry tachycardia (AVNRT), atrioventricular reentry tachycardia (AVRT), and sinus atrial-to-atrial wander rhythm (SAAWR) categories was only 121, 16, 8 and 7, respectively. The number of samples belonging to the four categories mentioned above was extremely small and hence excluded from this study. Finally, a sum of 10,436 ECG recordings belonging to 7 ECG rhythms were used in this study. Table 1 provides a comprehensive description of 7 distinct ECG rhythms along with the corresponding number of subjects.

Data Transformation
To utilize the 2D CNN model, it is necessary to transform the 1D ECG signal into a 2D image. Among the various methods of converting to a 2D image, we adopted a method of converting to a scalogram and a method of plotting a 1D ECG signal as it is in two dimensions. In this study, we refer to the former image as a scalogram image and the latter image as an ECG grayscale image. An ECG scalogram image is a visual representation of the time-frequency composition of the ECG signal that can reveal important information about the frequency characteristics of the ECG over time. Scalogram images were generated by applying the continuous wavelet transform (CWT) to the ECG recordings. An analytic Morse wavelet with a symmetry parameter of 3 (γ = 3) and a time-bandwidth product of 60 (P 2 = 60) was used to obtain the CWT. The Morse wavelet is perfectly symmetric in the frequency domain and has zero skewness when γ equals 3. The CWT was calculated using 10 voices per octave, a 500 Hz sampling frequency, and a signal length of 5000. The minimum and maximum scales were determined automatically based on the wavelet's energy spread in time and frequency [24]. In this study, we used the cwt.m function provided by Wavelet Toolbox in Matlab 2020a (https://www.mathworks.com/help/wavelet/ref/ cwt.html, accessed on 13 February 2023). The converted scalogram images were saved as 300 × 300 pixel RGB images. For ECG grayscale images, 1D ECG recordings were plotted as grayscale images with a white ECG signal against a black background. The ECG grayscale images were saved as 300 × 300 pixels. Examples of scalogram images and ECG grayscale images for the 7 groups (AFIB, AF, ST, SVT, SB, SR, and SI) are shown in Figure 1. product of 60 ( 2 = 60) was used to obtain the CWT. The Morse wavelet is perfectly symmetric in the frequency domain and has zero skewness when γ equals 3. The CWT was calculated using 10 voices per octave, a 500Hz sampling frequency, and a signal length of 5000. The minimum and maximum scales were determined automatically based on the wavelet's energy spread in time and frequency [24]. In this study, we used the cwt.m function provided by Wavelet Toolbox in Matlab 2020a (https://www.mathworks.com/help/wavelet/ref/cwt.html, accessed on 13 February 2023). The converted scalogram images were saved as 300 × 300 pixel RGB images. For ECG grayscale images, 1D ECG recordings were plotted as grayscale images with a white ECG signal against a black background. The ECG grayscale images were saved as 300 × 300 pixels. Examples of scalogram images and ECG grayscale images for the 7 groups (AFIB, AF, ST, SVT, SB, SR, and SI) are shown in Figure 1. categories. The first row shows scalogram images and the second row displays grayscale images of the ECG signals.

Ensemble Methods
Ensemble methods are a group of techniques that combine the predictions of multiple models to improve performance. There are many ensemble methods, but this study adopts simple averaging ensemble and stacking ensemble. Simple averaging ensemble obtains the output by averaging the predictions of individual learners directly. Owing to its simplicity and effectiveness, the method is popular in many real applications. The stacking ensemble consists of multiple base learners and a meta-learner. In stacking ensemble, each base learner trains with the original training dataset and then generates new datasets for training a meta learner, where the outputs of the base learner are regarded as input features of the meta learner. The stacking ensemble is powerful because it can combine the strengths of different models to produce a more accurate prediction [20].
Since we propose a multi-modal stacking ensemble method for diagnosing CVDs, we focus on a stacking ensemble. In this study, we use two types of image modalities: scalogram images and ECG grayscale images. The single-modal stacking ensemble refers to the stacking ensemble that utilizes only one image modality, whereas the multi-modal stacking ensemble refers to the stacking ensemble that incorporates two image modalities. We first explain the single modal stacking ensemble, and, to ensure clear understanding, we specifically describe the scenario where the input is a scalogram image. As shown in Figure 2, scalogram images are fed to a pretrained ResNet-50 model to be fine-tuned for each lead. Since we have 12 leads, 12 ResNet-50 base learners are fine-tuned with scalogram images. We can then obtain 12 predictions from 12 individual base learners. Each base learner's prediction is a 7-dimensional probabilities vector. Considering 12 leads, we can obtain 12 predictions that consist of 7-dimensional probability vectors. Simple averaging

Ensemble Methods
Ensemble methods are a group of techniques that combine the predictions of multiple models to improve performance. There are many ensemble methods, but this study adopts simple averaging ensemble and stacking ensemble. Simple averaging ensemble obtains the output by averaging the predictions of individual learners directly. Owing to its simplicity and effectiveness, the method is popular in many real applications. The stacking ensemble consists of multiple base learners and a meta-learner. In stacking ensemble, each base learner trains with the original training dataset and then generates new datasets for training a meta learner, where the outputs of the base learner are regarded as input features of the meta learner. The stacking ensemble is powerful because it can combine the strengths of different models to produce a more accurate prediction [20].
Since we propose a multi-modal stacking ensemble method for diagnosing CVDs, we focus on a stacking ensemble. In this study, we use two types of image modalities: scalogram images and ECG grayscale images. The single-modal stacking ensemble refers to the stacking ensemble that utilizes only one image modality, whereas the multi-modal stacking ensemble refers to the stacking ensemble that incorporates two image modalities. We first explain the single modal stacking ensemble, and, to ensure clear understanding, we specifically describe the scenario where the input is a scalogram image. As shown in Figure 2, scalogram images are fed to a pretrained ResNet-50 model to be fine-tuned for each lead. Since we have 12 leads, 12 ResNet-50 base learners are fine-tuned with scalogram images. We can then obtain 12 predictions from 12 individual base learners. Each base learner's prediction is a 7-dimensional probabilities vector. Considering 12 leads, we can obtain 12 predictions that consist of 7-dimensional probability vectors. Simple averaging ensemble averages the predictions of 12 single-lead ResNet-50 models that were independently trained. On the other hand, the stacking ensemble combines the predictions of the 12 base learners. That is, the 7-dimensional output probability vector from each lead is concatenated to make an 84-dimensional vector. Then the 84-dimensional vector is fed into a meta learner that outputs prediction values for the 7 ECG rhythms. As the meta learner, logistic regression, support vector machines (SVM), random forest, and XGBoost were employed in this study [25][26][27][28]. The single-modal stacking ensemble architecture for ECG grayscale images is the same as described above, except that the input image is an ECG grayscale image instead of a scalogram image. ensemble averages the predictions of 12 single-lead ResNet-50 models that were independently trained. On the other hand, the stacking ensemble combines the predictions of the 12 base learners. That is, the 7-dimensional output probability vector from each lead is concatenated to make an 84-dimensional vector. Then the 84-dimensional vector is fed into a meta learner that outputs prediction values for the 7 ECG rhythms. As the meta learner, logistic regression, support vector machines (SVM), random forest, and XGBoost were employed in this study [25][26][27][28]. The single-modal stacking ensemble architecture for ECG grayscale images is the same as described above, except that the input image is an ECG grayscale image instead of a scalogram image. Single-modal stacking ensemble considers only one input modality, whereas multimodal stacking ensemble methods take multiple input modalities into account. A detailed description of the multi-modal stacking ensemble is depicted in Figure 3. In this study, scalogram images and ECG grayscale images are used as two input modalities. In the proposed multi-modal stacking ensemble, we combine an 84-dimensional vector obtained from 12 individual base learners using scalogram images and another 84-dimensional vector attained from 12 individual base learners using ECG grayscale images. Combining the vectors obtained from the two modalities results in a 168-dimensional vector. The concatenated 168-dimensional vector contains the characteristics of a scalogram image and an ECG grayscale image. The 168-dimensional vector becomes the new input vector for the meta learner. Similar to the single-modal stacking ensemble, the multi-modal stacking ensemble employed logistic regression, SVM, random forest, and XGBoost as meta learners in this study. Single-modal stacking ensemble considers only one input modality, whereas multimodal stacking ensemble methods take multiple input modalities into account. A detailed description of the multi-modal stacking ensemble is depicted in Figure 3. In this study, scalogram images and ECG grayscale images are used as two input modalities. In the proposed multi-modal stacking ensemble, we combine an 84-dimensional vector obtained from 12 individual base learners using scalogram images and another 84-dimensional vector attained from 12 individual base learners using ECG grayscale images. Combining the vectors obtained from the two modalities results in a 168-dimensional vector. The concatenated 168-dimensional vector contains the characteristics of a scalogram image and an ECG grayscale image. The 168-dimensional vector becomes the new input vector for the meta learner. Similar to the single-modal stacking ensemble, the multi-modal stacking ensemble employed logistic regression, SVM, random forest, and XGBoost as meta learners in this study.

LSTM
LSTM is a type of recurrent neural network and a powerful method for the diagnosis of CVDs. By training an LSTM model on labeled ECG recordings, the model can learn to detect patterns and features that are indicative of CVDs. Due to the sequential nature of

LSTM
LSTM is a type of recurrent neural network and a powerful method for the diagnosis of CVDs. By training an LSTM model on labeled ECG recordings, the model can learn to detect patterns and features that are indicative of CVDs. Due to the sequential nature of ECG recordings, LSTM is well-suited for this task as it can capture long-term and temporal dependencies between individual ECG recordings. In this study, LSTM was applied to the same ECG dataset to demonstrate the effectiveness of the proposed multi-modal stacking ensemble method. In experiment settings, LSTM has numerous hyperparameters; however, this study chose to set the batch size, hidden size, dropout, and number of epochs to fixed values of 128, 128, 0.2, and 100, respectively. The Adam optimizer was used with β 1 set to 0.9 and β 2 set to 0.999 to optimize the LSTM model. To determine the learning rate and number of layers, a grid search was performed where the learning rates were evaluated over the range of (1e-3, 1e-4, 5e-5, 1e-5), and the number of layers was tested within the range of (2, 3, 4). The best hyperparameter was chosen by selecting the one with the highest accuracy on the validation dataset. To prevent the vanishing gradient problem, the ECG signal sampled at 500 Hz was downsampled to 250 Hz. LSTM was trained for all 12 leads at the same time since a 12-lead ECG signal can be represented as a sequence of a 12-dimensional vector with a length of T time sample. On the other hand, ResNet-50 was trained individually for each lead. BiLSTM can be seen as a variation of LSTM. Unlike LSTM, BiLSTM can analyze input sequences both forward and backward, which gives it the ability to comprehend information from past and future time-steps and identify complex inter-dependencies in the data. BiLSTM was also experimented under the same conditions.

ResNet-50 Model and Machine Learning Algorithms
ResNet is a deep neural network architecture introduced in 2015. It was developed to address the issue of vanishing gradients that arises in deep networks. This problem is resolved by adding skip connections between the layers. The skip connection is a type of feedforward network that involves a shortcut connection. It adds new inputs to the network and yields new outputs, enabling the network to learn the residual mapping instead of the original mapping. ResNet has achieved state-of-the-art accuracy in a variety of computer vision tasks and became one of the most popular architectures for image classification and computer vision tasks [29]. For this reason, we used a pretrained ResNet-50 model as a base learner. To fine-tune the ResNet-50 model, we utilized the Adam optimizer with β 1 = 0.9 and β 2 = 0.999. The experiments were conducted with three initial learning rates (1e-4, 5e-5, 1e-5) of the Adam optimizer. Of the three learning rates, 5e-5 was adopted as the most accurate in the validation set among individual base learners. We fixed the mini-batch size at 32 and the number of epochs at 30. The ResNet-50 model was developed with a PyTorch framework [30]. The computer specifications used in the experiments are as follows: Intel Core i7-9700K 3.60GHz CPU, 64GB memory, and a 12GB NVIDIA GeForce GTX 2080 Ti graphics card. In this study, we considered four machine learning classifiers as a meta learner of the stacking ensemble: logistic regression, SVM, random forest, and XGBoost. We employed Scikit-learn library (https://scikit-learn.org/stable/index.html, accessed on 30 January 2023) to implement logistic regression, SVM, and random forest classifiers, while XGBoost was implemented using XGBoost Python Package (https://xgboost.ai/, accessed on 30 January 2023). Optimal hyperparameters for the meta learner were chosen by performing a thorough grid search and evaluating the accuracy of the validation set. The details of the hyperparameters which were tuned using the grid search are described in Table 2. The code for training and evaluating the proposed multi-modal stacking ensemble model is available at: https://github.com/xodud5654/MMSE (accessed on 17 February 2023).

Results
We evaluated the individual base learner, simple averaging ensemble, and stacking ensemble methods on the publicly available Chapman University and Shaoxing People's Hospital dataset. The data was split into three parts: 80% for training, 10% for validation, and 10% for testing. As represented in Table 1, the samples of each class are imbalanced. Therefore, we considered a weighted averaging technique instead of a macroscopic averaging technique when evaluating the performance measures such as the area under the ROC curve (AUC), sensitivity, precision, and F1-score. The weighted averaging calculates a measure of performance for each class and then calculates a weighted mean. The weight is determined by the number of samples in each class relative to the total number of samples.
In Table 3, the performances of ResNet-50 based on scalogram images were compared to the performance of ResNet-50 based on ECG grayscale images for each lead.  For the single-modal ensemble methods, single-modal stacking ensemble methods achieved better results than the single-modal simple averaging ensemble and 12 individual base learners for both scalogram images and ECG grayscale images, as described in Table 4. For scalogram images, single-modal stacking ensembles with four machine learning algorithms showed the following diagnostic performance: AUC (ranging from 0.993 to 0.995), accuracy (ranging from 92.34 to 93.01), sensitivity (ranging from 0.923 to 0.930), precision (ranging from 0.915 to 0.925), and F1-score (ranging from 0.913 to 0.925). For ECG grayscale images, single-modal stacking ensembles with four machine learning algorithms achieved the following: AUC (0.993), accuracy (ranging from 92.34 to 93.01), sensitivity (ranging from 0.923 to 0.930), precision (ranging from 0.918 to 0.925), and F1-score (ranging from 0.917 to 0.924). Comparing the scalogram image and the ECG grayscale image, both single-modal stacking ensemble methods showed similar performance. However, random forest and XGBoost showed better results in scalogram images, and logistic regression showed better results in ECG grayscale images. For the multi-modal stacking ensemble method, the best accuracy (93.97%), sensitivity (0.940), precision (0.937), and F1-score (0.936) were obtained when logistic regression was used as a meta learner as shown in Table 5. In addition, we could obtain the best AUC (0.996) when XGBoost was used as a meta learner. Compared with LSTM, BiLSTM, individual base learners, and single-modal ensemble methods, the proposed multi-modal ensemble methods showed better diagnostic performances. In Figure 4, we represented confusion matrices of two individual leads, a single-modal stacking ensemble with random forest for scalogram images, a single-modal stacking ensemble with logistic regression for ECG grayscale images, multi-modal simple averaging ensemble, and a multi-modal stacking ensemble with logistic regression for comparison.

Discussion
In this study, we proposed a multi-modal stacking ensemble which combines information from different two modalities, scalogram images and ECG grayscale images. The ResNet-50 model was used as the individual base learner of the stacking ensemble, and one of the machine learning algorithms, logistic regression, SVM, random forest, and XGBoost was utilized as the meta learner. Logistic regression exhibited the highest accuracy, sensitivity, precision, and F1-score and XGBoost achieved the best AUC among the four machine learning algorithms when employed as a meta learner.
The proposed multi-modal stacking ensemble relies on the predictions obtained from both the ECG grayscale image and the scalogram image to generate final predictions. The

Discussion
In this study, we proposed a multi-modal stacking ensemble which combines information from different two modalities, scalogram images and ECG grayscale images. The ResNet-50 model was used as the individual base learner of the stacking ensemble, and one of the machine learning algorithms, logistic regression, SVM, random forest, and XG-Boost was utilized as the meta learner. Logistic regression exhibited the highest accuracy, sensitivity, precision, and F1-score and XGBoost achieved the best AUC among the four machine learning algorithms when employed as a meta learner.
The proposed multi-modal stacking ensemble relies on the predictions obtained from both the ECG grayscale image and the scalogram image to generate final predictions. The ECG grayscale image provides cardiologists with information similar to a patient's ECG graph displayed on a monitor, while the scalogram image offers information about the time-frequency relationship of the ECG signals. In other words, the proposed model has the advantage of collecting multi-modal information potentially contained in the ECG grayscale image and the scalogram image, thereby enabling more accurate predictions of CVDs. From a practical perspective, the utilization of multi-modal information can be crucial for improving the accuracy of predictions in medical environments where accuracy is of utmost importance.
There are many studies that have applied ensemble algorithms to the healthcare field. Kang et al. improved the AUC by simply averaging the predictions from five CNN algorithms (ResNet-101, Xception, Inception-v3, InceptionResNet-v2, DenseNet-201) in classifying breast microcalcification in screening mammograms [31]. Abdar et al. introduced a two-layer nested ensemble method that employed stacking and voting as the classifier to identify benign breast tumors from malignant cancers. Their results indicated that the proposed ensemble algorithms achieved higher performance than single classifiers and most of the previous works [32]. Rao et el. proposed an ensemble model, which integrates three CNNs (DenseNet-121, Inception-v3, and InceptionResnet-v2) in a novel way. The proposed ensemble model showed better performance than the traditional ensemble technique in predicting the recurrence of odontogenic keratocysts (OKCs) on a small chunk of biopsy [33].
There are various public ECG databases on the problem of arrhythmia classification: MIT-BIH arrhythmia database, CinC/Physionet Challenge 2017 database (CinC2017), China Physiological Signal Challenge 2018 database (CPSC2018), PTB-XL database, and Chapman University and Shaoxing People's Hospital arrhythmia database [6,[34][35][36][37]. Among these databases, some of the researchers employed the same database, the Chapman University and Shaoxing People's Hospital arrhythmia database, that we analyzed. Yildirim et al. constructed an efficient DNN model combining 1D CNN and LSTM and achieved a 92.24% accuracy [11]. Merdjanovska et al. adopted the CPSCWinnerNet model, the winning model of the 2018 China Physiological Signal Challenge, consisting of convolutional blocks, GRUs, and an attention layer. They achieved an accuracy of 94.00% [38]. Baygin et al. proposed a novel classification model which generated 16,384 multilevel features using homeomorphically irreducible tree and maximum absolute pooling. The Chi2 feature selector was used to select the 1000 most informative features, which were subsequently classified using the SVM classifier. The model showed a 92.95% accuracy despite being a feature-based method rather than an end-to-end method [39]. Guan et al. presented a new approach called the hidden attention residual network (HA-ResNet) for the automated classification of arrhythmia. They used three different images, Recurrence Plot, Gramian Angular Field, and Markov Transition Field, as input images which were converted from 1D ECG. The Ha-ResNet algorithm achieved an F1-score of 0.876, a sensitivity of 0.882, and a precision of 0.876 [40]. It is prudent to be careful when comparing directly to the studies mentioned above due to differences in the test data. However, our proposed multi-modal stacking ensemble achieved comparable performance.
Despite demonstrating reasonable performance, this study has some limitations. First, with the exception of LSTM and BiLSTM, the majority of the experiments covered in the study are based on 2D CNNs. We compared the proposed method with base learners and single-modal ensemble methods to show the effectiveness of the proposed multimodal stacking ensemble. However, it would also be worthwhile to compare the proposed method with feature-based machine learning algorithms or 1D CNN models. The second limitation pertains to the dataset utilized in this study. The 12-lead ECG arrhythmia database collected by Chapman University and Shaoxing People's Hospital is based on severely imbalanced data. As described in Table 1, the SB category has 3888 samples, while the SI category only contains 397 samples. In order to alleviate this problem, we evaluated the performance measures with a weighted averaging technique instead of a macroscopic averaging technique. To address this issue, one could consider using several large publicly available ECG data sets, such as the recently published PTB-XL [37]. Third, when constructing the stacking ensemble, only one 2D CNN algorithm, ResNet-50, was used as the base learner. It would be necessary to optimize the architecture of the proposed model with a variety of combinations of deep learning and machine learning algorithms.

Conclusions
In this study, we proposed the use of a multi-modal stacking ensemble for the prediction of CVDs. The proposed method achieved superior performance compared to LSTM, BiLSTM, individual base learner, simple averaging ensemble, and single-modal stacking ensemble methods. These results suggest that a multi-modal stacking ensemble may be a promising approach for improving the accuracy of CVD prediction. Further research is needed to explore the use of multi-modal stacking ensemble methods with large ECG datasets and other combinations of 2D CNNs and machine learning algorithms.